pRactice corner: Tidy Tuesday Bird Bath Data

lruolin

Background

This is a post reproducing Julia Silge’s tutorial on logistic regression to predict the probablity of seeing birds at bird baths, using the tidymodels framework.

Learning points

how to fit logistic regression with interaction, using tidymodels framework
how to run cross-validation in parallel
must always remember to use recipe to change outcome variable to factor
using strata for imbalanced outcome
practicing using tidyeval for ggplots for exploratory data analysis

Packages

library(pacman)
p_load(tidyverse, tidymodels, ggsci, janitor, tidytuesdayR, skimr,
       patchwork)

Load data

The data is from Tidy Tuesday:

# to download data
bird_baths <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-31/bird_baths.csv')

bird_baths %>% 
  head()

# A tibble: 6 × 5
  survey_year urban_rural bioregions       bird_type        bird_count
        <dbl> <chr>       <chr>            <chr>                 <dbl>
1        2014 Urban       South Eastern Q… Bassian Thrush            0
2        2014 Urban       South Eastern Q… Chestnut-breast…          0
3        2014 Urban       South Eastern Q… Wild Duck                 0
4        2014 Urban       South Eastern Q… Willie Wagtail            0
5        2014 Urban       South Eastern Q… Regent Bowerbird          0
6        2014 Urban       South Eastern Q… Rufous Fantail            0

From the documentation:

urban areas: with population of 1,000 = 100,000
rural areas: with population of < 999 people

EDA

bird_baths %>% 
  skim()

Table 1: Data summary
Name	Piped data
Number of rows	161057
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
urban_rural	169	1	5	5	2
bioregions	169	1	12	24	10
bird_type	0	1	4	28	169

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
survey_year	169	1	2014.45	0.50	2014	2014	2014	2015	2015	▇▁▁▁▆
bird_count	0	1	0.07	2.01	0	0	0	0	292	▇▁▁▁▁

This is a very long dataset: 161057 rows by 5 variables. There are some missing data for urban_rural and bioregions. Change all year to factor, all character to factor, and bird count to factor (since 0 = not seen, 1 = seen)

glimpse(bird_baths)

Rows: 161,057
Columns: 5
$ survey_year <dbl> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ urban_rural <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Ur…
$ bioregions  <chr> "South Eastern Queensland", "South Eastern Queen…
$ bird_type   <chr> "Bassian Thrush", "Chestnut-breasted Mannikin", …
$ bird_count  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

bird_baths %>% 
  filter(is.na(bioregions))

# A tibble: 169 × 5
   survey_year urban_rural bioregions bird_type             bird_count
         <dbl> <chr>       <chr>      <chr>                      <dbl>
 1          NA <NA>        <NA>       Bassian Thrush                 4
 2          NA <NA>        <NA>       Chestnut-breasted Ma…          9
 3          NA <NA>        <NA>       Wild Duck                      9
 4          NA <NA>        <NA>       Willie Wagtail               114
 5          NA <NA>        <NA>       Regent Bowerbird              11
 6          NA <NA>        <NA>       Rufous Fantail                19
 7          NA <NA>        <NA>       Spiny-cheeked Honeye…          4
 8          NA <NA>        <NA>       Flame Robin                    1
 9          NA <NA>        <NA>       European Goldfinch             6
10          NA <NA>        <NA>       Noisy Friarbird               34
# … with 159 more rows

bird_baths %>% 
  filter(is.na(survey_year))

# A tibble: 169 × 5
   survey_year urban_rural bioregions bird_type             bird_count
         <dbl> <chr>       <chr>      <chr>                      <dbl>
 1          NA <NA>        <NA>       Bassian Thrush                 4
 2          NA <NA>        <NA>       Chestnut-breasted Ma…          9
 3          NA <NA>        <NA>       Wild Duck                      9
 4          NA <NA>        <NA>       Willie Wagtail               114
 5          NA <NA>        <NA>       Regent Bowerbird              11
 6          NA <NA>        <NA>       Rufous Fantail                19
 7          NA <NA>        <NA>       Spiny-cheeked Honeye…          4
 8          NA <NA>        <NA>       Flame Robin                    1
 9          NA <NA>        <NA>       European Goldfinch             6
10          NA <NA>        <NA>       Noisy Friarbird               34
# … with 159 more rows

bird_baths %>% 
  count(bird_count)

# A tibble: 65 × 2
   bird_count      n
        <dbl>  <int>
 1          0 155344
 2          1   5570
 3          2     12
 4          3     12
 5          4      9
 6          5      7
 7          6      4
 8          7      6
 9          8      6
10          9      8
# … with 55 more rows

bird_bath_working <- bird_baths %>% 
  mutate(across(.cols = survey_year:bird_type , as.factor)) %>% 
  na.exclude() # remove NA

glimpse(bird_bath_working)

Rows: 160,888
Columns: 5
$ survey_year <fct> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ urban_rural <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban,…
$ bioregions  <fct> South Eastern Queensland, South Eastern Queensla…
$ bird_type   <fct> Bassian Thrush, Chestnut-breasted Mannikin, Wild…
$ bird_count  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Reusing my ggplot code from previous post:

# Create function:

plot_fct_boxplot <- function(var_x) {
  bird_bath_working %>% 
  mutate(var_x_fct = factor({{var_x}})) %>%  # so that other classes can be used for plot as well
  ggplot(aes(var_x_fct)) +
  geom_bar(aes(fill = var_x_fct), show.legend = F) +
  geom_text(stat = "count", aes(label = ..count..), hjust = -0.1) +
  scale_fill_jco() + # from ggsci package
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  theme_classic() +
  labs(title = str_to_title(as_label(enquo(var_x))),
       x = as_label(enquo(var_x))) +
  theme(axis.title = element_text(face = "bold"),
        axis.text.x = element_text(face = "bold")) +
    coord_flip()
}

# Set names to loop through:

var_fct <- bird_bath_working %>% 
  select(survey_year, urban_rural, bioregions, bird_count) %>% 
  names()


plots <- var_fct %>% 
  syms() %>% 
  map(function(var) plot_fct_boxplot({{var}}))


(plots[[1]] + plots[[2]] + plots[[4]]) / plots[[3]]

Bird_count is very imbalanced. Most of the observations come from urban area (about 70%). Most of the data is from Sydney Basin.

glimpse(bird_bath_working)

Rows: 160,888
Columns: 5
$ survey_year <fct> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ urban_rural <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban,…
$ bioregions  <fct> South Eastern Queensland, South Eastern Queensla…
$ bird_type   <fct> Bassian Thrush, Chestnut-breasted Mannikin, Wild…
$ bird_count  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

bird_bath_working %>% 
  distinct(bird_type) %>% 
  count() # there are 169 types of birds seen

# A tibble: 1 × 1
      n
  <int>
1   169

top_15_birds <- bird_bath_working %>% 
  group_by(bird_type) %>% 
  summarise(times_seen = sum(bird_count)) %>% 
  arrange(desc(times_seen)) %>% 
  slice_head(n = 15)

names_top_birds <- top_15_birds %>% 
  pull(bird_type)

bird_bath_working %>% 
  filter(bird_type %in% names_top_birds) %>% 
  group_by(bird_type, bioregions) %>% 
  summarise(times_seen = sum(bird_count), .groups = "drop") %>% 
  arrange(desc(times_seen))  %>% 
  ggplot(aes(bioregions, fct_rev(bird_type))) +
  geom_tile(aes(fill = times_seen)) +
  scale_fill_continuous(low = "snow1", high = "darkorange4") +
  labs(title = "Heatmap for the top 15 types of birds (by total number of times seen)",
       x = "Bio Regions",
       y = "Bird type",
       caption = "Source: Cleary et al, 2016") +
  theme_classic() +
  theme(axis.text = element_text(face = "bold"))

Rarest birds

bottom_15_birds <- bird_bath_working %>% 
  group_by(bird_type) %>% 
  summarise(times_seen = sum(bird_count)) %>% 
  arrange(desc(times_seen)) %>% 
  slice_tail(n = 15)

names_rare_birds <- bottom_15_birds %>% 
  pull(bird_type)

names_rare_birds

 [1] Mangrove Gerygone         Mangrove Honeyeater      
 [3] Pale-yellow Robin         Purple Swamphen          
 [5] Purple-crowned Lorikeet   Satin Flycatcher         
 [7] Song Thrush               Southern Whiteface       
 [9] Striated Pardalote        Superb Lyrebird          
[11] Tawny-crowned Honeyeater  Tete                     
[13] Varied Triller            Welcome Swallow          
[15] Red-tailed Black Cockatoo
169 Levels: Apostlebird ... Yellow-tufted Honeyeater

bird_bath_working %>% 
  filter(bird_type %in% names_rare_birds) %>% 
  group_by(bird_type, bioregions) %>% 
  summarise(times_seen = sum(bird_count), .groups = "drop") %>%
  arrange(desc(times_seen))  %>% 
  ggplot(aes(bioregions, fct_rev(bird_type))) +
  geom_tile(aes(fill = times_seen)) +
  scale_fill_continuous(low = "snow1", high = "darkorange4") +
  labs(title = "Heatmap for the rarest 15 types of birds (by total number of times seen)",
       x = "Bio Regions",
       y = "Bird type",
       caption = "Source: Cleary et al, 2016") +
  theme_classic() +
  theme(axis.text = element_text(face = "bold"))

Probability of seeing birds of different in different locations

# data that looks at av number of times of seeing birds in urban/rural region
top_15_urban_rural <- bird_bath_working %>% 
  filter(bird_type %in% names_top_birds) %>% 
  group_by(urban_rural, bird_type) %>% 
  summarise(av_times_seen = mean(bird_count), .groups = "drop") %>%  # mean is used, not sum
  arrange(desc(av_times_seen)) 

head(top_15_urban_rural)

# A tibble: 6 × 3
  urban_rural bird_type         av_times_seen
  <fct>       <fct>                     <dbl>
1 Rural       Superb Fairy-wren         0.364
2 Urban       Noisy Miner               0.354
3 Urban       Rainbow Lorikeet          0.296
4 Rural       Eastern Spinebill         0.296
5 Urban       Australian Magpie         0.283
6 Rural       Grey Fantail              0.282

# pivot wider so that can use data for geom_segment later
top_15_urban_rural_wide <- top_15_urban_rural %>% 
  pivot_wider(names_from = urban_rural,
              values_from = av_times_seen) %>% 
  arrange(desc(Urban)) %>% 
  rowid_to_column("order")

head(top_15_urban_rural_wide)

# A tibble: 6 × 4
  order bird_type          Rural Urban
  <int> <fct>              <dbl> <dbl>
1     1 Noisy Miner       0.201  0.354
2     2 Rainbow Lorikeet  0.136  0.296
3     3 Australian Magpie 0.259  0.283
4     4 Spotted Dove      0.0374 0.213
5     5 Red Wattlebird    0.119  0.207
6     6 Magpie-lark       0.112  0.193

top_15_urban_rural %>% 
 # inner join to see the ordering
  left_join(top_15_urban_rural_wide, by = "bird_type") %>% 
  ggplot(aes(av_times_seen, fct_reorder(bird_type, -order))) +
  geom_point(aes(col = urban_rural), size = 3) +

  geom_segment(
    aes(x = Rural, xend = Urban,
        y = bird_type, yend = bird_type),
    alpha = 0.7, col = "gray70", size = 1
  ) +
  
  scale_x_continuous(labels = scales::percent, 
                     limits = c(0, 0.5)) +
  scale_color_jco() +
  labs(title = "Av number of times seen for top 15 most common types of birds",
       subtitle = "Birds are arranged in descending order for probability of seeing in urban areas",
       x = "Av number of times seen",
       y = "Bird type",
       caption = "Source: Cleary et al, 2016",
       col  = NULL) +
  theme_classic() +
  theme(legend.position = "top")

Building a logistic regression model

Learning point: character variables do not work well for modelling. Must remember to change outcome variable to a factor.

glimpse(bird_bath_working)

Rows: 160,888
Columns: 5
$ survey_year <fct> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ urban_rural <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban,…
$ bioregions  <fct> South Eastern Queensland, South Eastern Queensla…
$ bird_type   <fct> Bassian Thrush, Chestnut-breasted Mannikin, Wild…
$ bird_count  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

bird_bath_working <- 
  bird_bath_working %>% 
  mutate(bird_seen = if_else(bird_count > 0, "bird seen", "not seen")) %>%  # outcome must be a factor!
  mutate(across(where(is.character), as.factor))

set.seed(2021101001)

bird_split <- initial_split(bird_bath_working,
                            strata = bird_count) # since there is an imbalance

# create a df for train and test data

bird_train_data <- training(bird_split) # 120,666
bird_test_data <- testing(bird_split) # 40,222

# define model:
lr_model <- logistic_reg()

lr_model

Logistic Regression Model Specification (classification)

Computational engine: glm

# create recipe: outcome and two predictors
bird_recipe <- recipe(bird_seen ~ bird_type + urban_rural, data = bird_train_data) %>% 
                step_dummy(all_nominal_predictors()) # to create dummy variables from factors


summary(bird_recipe)

# A tibble: 3 × 4
  variable    type    role      source  
  <chr>       <chr>   <chr>     <chr>   
1 bird_type   nominal predictor original
2 urban_rural nominal predictor original
3 bird_seen   nominal outcome   original

bird_recipe %>% prep()

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          2

Training data contained 120666 data points and no missing data.

Operations:

Dummy variables from bird_type, urban_rural [trained]

# bundle workflow

bird_workflow <- 
  workflow() %>% 
  add_model(lr_model) %>% 
  add_recipe(bird_recipe)

bird_workflow

══ Workflow ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ──────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ─────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm

# create cross-validation: to get better estimates of performance
set.seed(2021101002)
bird_folds <- vfold_cv(bird_train_data, v = 10, strata = bird_count) # since there is an imbalance

# fit resamples to workflow:
doParallel::registerDoParallel()
control_pred <- tune::control_resamples(save_pred = T)
lr_fit_rs <- fit_resamples(bird_workflow, bird_folds, control = control_pred)

# collect metrics

collect_metrics(lr_fit_rs) # acc = 0.966, auc = 0.858

# A tibble: 2 × 6
  .metric  .estimator  mean     n  std_err .config             
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy binary     0.966    10 0.000338 Preprocessor1_Model1
2 roc_auc  binary     0.858    10 0.00307  Preprocessor1_Model1

rs_roc_plot <- augment(lr_fit_rs) %>% 
  clean_names() %>%  # so that pred_bird_seen can be used
  roc_curve(bird_seen, pred_bird_seen) %>% 
  autoplot()

rs_roc_plot

Adding interactions

bird_recipe_interact <- 
  bird_recipe %>% 
  step_interact(~starts_with("urban_rural"):starts_with("bird_type"))

bird_workflow_interact <- 
  workflow() %>% 
  add_model(lr_model) %>% 
  add_recipe(bird_recipe_interact)

lr_fit_rs_interact <- fit_resamples(bird_workflow_interact,
                                    bird_folds,
                                    control = control_pred)

collect_metrics(lr_fit_rs_interact) # 0.966 acc, 0.869 auc , slightly higher.

# A tibble: 2 × 6
  .metric  .estimator  mean     n  std_err .config             
  <chr>    <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy binary     0.966    10 0.000338 Preprocessor1_Model1
2 roc_auc  binary     0.869    10 0.00260  Preprocessor1_Model1

The AUC is higher for the interact model. However, this model took about 4 hours to run on my laptop….

Comparing AUC Curves

lr_mod_df <- augment(lr_fit_rs) %>% 
  clean_names() %>% 
  roc_curve(bird_seen, pred_bird_seen) %>% 
  mutate(model = "log reg model")


lr_interact_mod_df <- augment(lr_fit_rs_interact) %>% 
  clean_names() %>% 
  roc_curve(bird_seen, pred_bird_seen) %>% 
  mutate(model = "log reg model with interaction") 

bind_rows(lr_mod_df, lr_interact_mod_df) %>% 
  ggplot(aes(x = 1-specificity, y = sensitivity, col = model)) +
  geom_path(lwd = 1.5, alpha = 0.5) +
  geom_abline(lty = 3) +
  coord_equal() +
  scale_color_jco() +
  theme_classic()

Evaluating model on test data

bird_fit <- parsnip::fit(bird_workflow_interact, bird_test_data)

Predict the test set

predict(bird_fit, bird_test_data, type = "prob")

# A tibble: 40,222 × 2
   `.pred_bird seen` `.pred_not seen`
               <dbl>            <dbl>
 1     0.0133                   0.987
 2     0.00581                  0.994
 3     0.114                    0.886
 4     0.0629                   0.937
 5     0.00000000117            1.00 
 6     0.00000000117            1.00 
 7     0.195                    0.805
 8     0.0235                   0.976
 9     0.131                    0.869
10     0.0850                   0.915
# … with 40,212 more rows

Predicting on new data

# Creating new data frame
new_bird_data <- 
  tibble(bird_type = top_15_birds$bird_type) %>% 
  crossing(urban_rural = c("Urban", "Rural")) # all unique combinations

# Augment to new data

birds_predicted_new_df <- augment(bird_fit, new_bird_data) %>% 
  bind_cols(
    predict(bird_fit, new_bird_data, type = "conf_int")
  ) %>% 
  clean_names()

birds_predicted_new_df

# A tibble: 30 × 9
   bird_type         urban_rural pred_class pred_bird_seen pred_not_seen
   <fct>             <chr>       <fct>               <dbl>         <dbl>
 1 Australian Magpie Rural       not seen           0.205          0.795
 2 Australian Magpie Urban       not seen           0.254          0.746
 3 Crested Pigeon    Rural       not seen           0.110          0.890
 4 Crested Pigeon    Urban       not seen           0.168          0.832
 5 Crimson Rosella   Rural       not seen           0.231          0.769
 6 Crimson Rosella   Urban       not seen           0.131          0.869
 7 Eastern Spinebill Rural       not seen           0.321          0.679
 8 Eastern Spinebill Urban       not seen           0.114          0.886
 9 Grey Fantail      Rural       not seen           0.301          0.699
10 Grey Fantail      Urban       not seen           0.0909         0.909
# … with 20 more rows, and 4 more variables:
#   pred_lower_bird_seen <dbl>, pred_upper_bird_seen <dbl>,
#   pred_lower_not_seen <dbl>, pred_upper_not_seen <dbl>

birds_predicted_new_df %>% 
  filter(bird_type == "Spotted Dove")

# A tibble: 2 × 9
  bird_type    urban_rural pred_class pred_bird_seen pred_not_seen
  <fct>        <chr>       <fct>               <dbl>         <dbl>
1 Spotted Dove Rural       not seen    0.00000000117         1.00 
2 Spotted Dove Urban       not seen    0.180                 0.820
# … with 4 more variables: pred_lower_bird_seen <dbl>,
#   pred_upper_bird_seen <dbl>, pred_lower_not_seen <dbl>,
#   pred_upper_not_seen <dbl>

# visualizing

birds_predicted_new_df %>% 
  ggplot(aes(pred_bird_seen, bird_type, col = urban_rural)) +
  geom_errorbar(aes(
    xmin = pred_lower_bird_seen,
    xmax = pred_upper_bird_seen
  ), 
  width = 0.2, size = 1.2, alpha = 0.5) +
  geom_point(size = 3) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Predicted probability of seeing bird",
       y = "",
       col = NULL,
       caption = "Caption: Cleary et al, 2016") +
  scale_colour_jco() +
  theme_classic()

The spotted dove is very hard to spot in rural places!

Reference:

[Julia Silge’s Youtube Video] [https://www.youtube.com/watch?v=NXot3Q0QtGk]
https://juliasilge.com/blog/bird-baths/

Comment on this article Share:

Tidy Tuesday Bird Bath Data